I always asked myself what affects the wine quality? Is it because it’s french/australian, or is it because of a certain processs or lack/availability of a certain chemical. This will be the perfect chance to explore and try to find some answers.
A note to self that the data belongs to the Portuguese “Vinho Verde” wine only. In addition, due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available; that is, there is no data about grape types, wine brand, wine selling price, etc. It’s a shame that these additional data are missing, but still I think interesting insights can be taken from this dataset, so let’s begin.
Ok, so except quality, the rest of the attributes are floating point numbers.
Some description for the columns in order of appearence (see more details here: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt):
I’ll start exploring the acidity of the wines. Four attributes contribute to this: acetic acid, citric acid, tartaric acid, and pH which is a dependent variable.
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines**
4 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
A few things that can be observed from the above:
The acidity and the flavour that above acids give to the sample could very likely affect the quality. Need to further explore.
Residual sugar is the amount of sugar remaining after fermentation stops. It’s said it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2218 0.2304 0.7160 0.6432 0.9956 1.8180
The residual sugar distribution is quite long tailed and there is a big spike for lower values. Hence, I applied a log 10 transformation and it looks like a bimodal distribution now. I think this could show the nature of sugar better, and hence I’ll use the log10 values of sugar in later charts.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Salt distribution also looks long taield with quite a few samples above 75 quantile. But actually the majority of samples fall between 0.01 and 0.08, and those have a normal-looking distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Although 3rd quantile of free so2 distribution falls below 50, I wonder how those samples with valuse above 50 affect quality.
let’s look at the distributions:
Both distribution look like a normal distribution. There are a few outliers for both free and total SO2, but majority of samples fall between (0,100) for free SO2 and (0,300) for total SO2.
Total SO2 is a superset of free SO2 and obviously there should be a noticeable positvie correlation between both.
the density of sample is close to that of water depending on the percent alcohol and sugar content
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density range is very small and the changes are in 0.0001 level. Distirbution looks like normal and the (0.985,1.005) range seems to capture most of the samples.
Sulphate (SO4) is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. Let’s move on with Sulphate level distirbution (SO4).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The median for SO4 distribution is 0.47 and distribution looks like normal. We’ll have to see if SO4 levels have correlation with other params and most importantly quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol value ranges from 8 to 14.2 percent. Although the mean and median values are quite close (10.51 and 10.4) the distribution doesn’t exactly look like a normal distribution.
I did a log10 transformation and i think it might be a better distribution for later analysis.
In this dataset, the judgment on quality is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
We can see that median and 3rd quantile are both 6.0. There is no sample with rating below 3 nor above 9. Distribution also follows a nice bell-looking shape.
This is the most important parameter in the dataset and one of my main goals would be how other paramters affect quality.
After going through the analysis of the above variables, one thing that striked me is that the distribution for most variables looks bell-shaped.
Ok so beside the fact that acidity means low pH I don’t remember much from highschool chemistry. So let’s look at the data and see if we can find some relationship. I guess density should be another variable that is affected by impurities (anything other than water like alcohol).
There are two many variables and ggpairs is quite congested and can’t draw much information from it. So i’ll be happy with a correlation matrix.
Assuming that correlations above 0.3 are worth considering and above 0.7, I’ll further explore the relationship between each two variables using scatterplots.
Also to note that pH is mostly affected by Tartaric acid -0.42585829 (basic chemistry). The ineffectiveness of other acids on pH may be explainable by the fact that their range doesn’t vary much (as investigated earlier).
Free and total SO2 correlation: 0.6155009650 (explained by subset relationship, not interesting)
Notable correlation values with other parameters:
Density and sugar: 0.83896645
Density and alchohol: -0.78013762
Density and total SO2: 0.529881324
The way Density is affected by above components may be explainable by basic physics/chemistry as a natural cause and affect relationship. For example density of sugar is greater than 1 so the more it’s present the higher the density. For alcohol it’s a reverse correlation. I’m not particularly interested in pursuing this further.
SO2 and sugar correlation: 0.40143931
This can be explaned by the fact that the residual sugar is susceptible to attack by microorganisms which would cause a restart of fermentation. Hence they add SO2 as an antiseptic, depending on the sugar level.
This could indicate that we can expect the same correlation behaviour from sugar and SO2 against other variables.
Quality is probably the most interesting aspect of this dataset. At the end of the day it’s that taste that matter. So it will be definitely interesting to find patterns that could explain some relationship between quality and other components.
Notable correlation between alcohol and other attributes are:
Quality and salt (less than 0.3 but still): -0.20993441 Quality and density: -0.30712331 Quality and alcohol: 0.435574715
Let’s look at the relationship between quality and alcohol/density first.
## geom_smooth: na.rm = FALSE
## stat_smooth: na.rm = FALSE, method = lm, formula = y ~ x, se = TRUE
## position_identity
This chart indicates the positive correlation of quality with alcohol and its negative correlation with density. Alcohol and density are highly correlated (-0.78) so the top 2 charts look like a duality. We can summarise the relationship between the 3 like below:
The black dotted line indicates median of alcohol values.
It can be seen that for samples above orange line (alcohol levels above alcohol median), there is a higher concentration of the blue dots, implying higher quality.
The other attribute that has a notable correlation with quality is salt. Let’s see how the relationship looks like:
Ok so this chart indeed suggests that there is a negative correlation between quality and salt.
From dataset information, we can observe that: - citric acid could add freshness to taste - high levels of acetic acid can lead to an unpleasant, vinegar taste - SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine - sugar and acidity are two key taste factors
So although correlation between quality and these few parameters is low, they may explain certain patterns in the data. Let’s see if we can find something that explains the variance in quality<>alcohol relationship.
Ok so the most obvious pattern is that of sugar. The area above smoother line has a higher red concentration compared to bottom part of chart which is more greenish. This can be explained by the negative correlation of sugar and alcohol. There’s a bit of similar observable pattern for salt and also for SO2. Similarity of SO2 pattern can be explained by their positive correlation as established earlier.
I can’t really draw a conclusion for acetic acid, citric acid, and pH.
Having studied this dataset, we tried to find relationships/correlations between data. Here I present the 3 most interesting results of my findings.
The black dotted line indicates the alcohol median.
This chart shows both the correlation between alcohol and density, and also a trend that higher alcohol (and thus lower density) has a positive correlation with quality. This can be recognised by the bluish hue of dots in upper left. This is the most obvious pattern in this dataset.
The black dotted line indicates the salt median.
This chart suggests that the higher the sugar level and salt level, the less the alcohol; so the top right of the chart has a more red concentration. Salt and sugar are both negatively correlated to alcohol, but the correlation among themselves is not quite obvious.
This chart is specially important since it establishes the relationship between the 3 key influential attributes of the dataset: salt, alcohol, and sugar.
This chart combines the previous key obsevations. Besides showing the positive correlation between quality and alcohol, this chart also shows how sugar and salt levels can be used to explain the variance in alcohol levels. The area above the smoother line has a red-ish hue while the bottom area has a more concentration of green dots.
Probably the biggest challenge in analysing this dataset was lack of background knowledge in wine processign and the chemicals that are present and vary during this process. So I had to read up a bit and get myself comfortable with the chemical elements in the dataset.
The number of variables where also quite too many so figuring out which are relevant and which are less relevant took me some time. For example, density and pH could be considered as a dependent variable. SO4, citric acid, and acetic acid are quite neutral in terms of correlation with other parameters.
The most interesting variables in this dataset are thus quality, alcohol, salt and residual sugar. I belive this is one of the key findings of this study.
I further showed that the quality seems to be most sensetive to the alcohol level and also to salt & sugar, although to a lesser extent.
This may or may not be a cause and effect relationship. Let’s assume we’re not talking about manually increasing alcohol level or decreasing sugar/salt levels. But if adjusting the wine making process reasonably would have led to such changes, then it might indeed result in a beter rating. Of course this is something that needs to be further explored and it’s important to note that our findings are in the context of Portuguese “Vinho Verde” white wines.
The latter is actually an import thing to note. Our analysis is based on a single brand of white wine. The study would have been more rich if we had data from other brands. For example, australian wines might have a different growing environments for grapes which could have affected the chemicals. Generally I think having more diversity would help us identify patterns which might have been undetected.
I’m also intersted to know how chemical properties of white wine compare to those of red wine. So a natural step for this study would be to include Portuguese “Vinho Verde” red wine data and try to draw comparisons between the two types of wine.